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Abstract 

A comparison of PGI OpenACC, FORTRAN CUDA, and Nvidia CUDA 
pseudospectral methods on a single GPU and GCC FORTRAN on single 
and multiple CPU cores is reported. The GPU implementations use CuFFT 
and the CPU implementations use FFTW. Porting pre-existing FORTRAN 
codes to utilize a GPUs is efficient and easy to implement with OpenACC 
and CUDA FORTRAN. Example programs are provided. 

1 Introduction 

Graphics processing units (GPUs) can have better performance for mathematical 
operations on large arrays when compared to traditional central processing units 
(CPUs). The fast Fourier transform (FFT) is one application for which GPUs have 
a significant performance advantage over CPUs. The performance advantage can 
be significant for simulations which fit within the memory constraints of a single 
GPU. 

GPU acceleration has largely been accomplished in variants of C with vari- 
ants of FORTRAN being a recent addition. A comparison of performance of the 
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lesser known OpenACC and CUDA FORTRAN on GPUs is of interest because a 
large number of legacy codes use FORTRAN, flU ll20ll and because [10, Chp. 6] 
indicates that on CPUs, FORTRAN compilers typically generate more efficient 
scientific computing codes than C compilers. Previous works which have ex- 
amined speedup offered by single GPUs in solving differential equations using 
Fourier transforms are Q, (H and 11211 . 

In this paper, extensions of FORTRAN for GPUs when solving nonlinear 
PDEs with pseudo spectral methods are compared.The user friendliness of these 
different GPU programming models is described. A benchmark suite of algo- 
rithms for three different nonlinear PDEs with Fourier pseudospectral methods is 
also provided. 

2 Programming Models 

At present it is unclear what programming model will be widely adopted for ac- 
celerators. Some current options include OpenCL, OpenACC, OpenHMPP, F2C- 
ACC, CUDA, and CUDA FORTRAN. This note will compare CUDA C, CUDA 
FORTRAN, and OpenACC GPU programs to serial and OpenMP[ 13 ] FORTRAN 
CPU programs. 

2.1 FORTRAN and OpenACC 

OpenACC is a standard currently under development to allow for parallel pro- 
gramming of CPUs and GPUs lfT2l . It primarily uses directives to move data be- 
tween a host CPU and a GPU and parallelize operations on the GPU. FORTRAN, 
C, and C++ are supported and there is a hope that it will be merged with OpenMP 
in the future to allow for a unified interface to GPUs and other accelerators. 

2.2 CUDA C 

CUDA C is a set of extensions to C that allow special functions, called kernels, 
to be executed on supported NVIDIA GPUs[l lj. CUDA C provides a lower level 
interface than many of the other application interfaces discussed here. 
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2.3 CUDA FORTRAN 



CUDA FORTRAN, is a set of extensions to FORTRAN that allows kernels to 
execute on NVIDIA GPUsL14J. The main constructs are similar to CUDA C; 
however, CUDA FORTRAN has several directives that can automatically generate 
kernels for common cases. CUDA FORTRAN provides both a high level interface 
similar to OpenACC and a low level interface like CUDA C. 

3 An Overview of The Equations 

The three choosen equations, which are of mathematical and physical interest, can 
be solved entirely using Fourier pseudospectral methods and simple representative 
timestepping schemes. The resulting programs are short, can be understood in 
their entirety by a single person and only require an FFT routine. 

3.1 The Cubic Nonlinear Schrodinger Equation 

The focusing two dimensional cubic nonlinear Schrodinger equation is 

ilpt + Ipxx + ^yy = MV> (1) 

where ip(x, y, t) is a complex valued function of time, t, and two spatial variables, 
x and y. This equation arises in a variety of contexts including quantum me- 
chanics, in simplified models for lasers, and water waves. When ip has periodic 
boundary conditions, the cubic Schrodinger equation has two conserved quanti- 
ties, 

\i/;\ 2 dxdy and JJ^L-^dxdy, (2) 

known as the mass and energy respectively; they can be used to assess the accu- 
racy of a numerical solution. For more background on this equation, see [fT6l and 

ED- 

3.2 The Sine-Gordon Equation 

The 2D sine-Gordon equation, 

u t t — Au = — sin-u, (3) 
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arises in many different applications, including propagation of magnetic flux on 
Josephson junctions, sound propagation in a crystal lattice, and several others 
discussed in [15]. It has a conserved Hamiltonian 

H = J J i (u 2 t + |Vw| 2 ) + (1 - cosw) dxdy, (4) 

which is useful in evaluating the accuracy of numerical solutions. 

3.3 The 2D Navier-Stokes Equation 

The 2D incompressible Navier-Stokes equations in stream function (^)-vorticity 
(to) form are 

uj t + ipyUx — ipx^y = Au and Aip = —to. (5) 

These equations model fluid flow, and further background on them can be found in 
[fT8ll among other references. The Taylor-Green vortex solution of these equations 
is 

uj(x, y, t) = An sin(27ra;) sin(27n/) exp(— 87r 2 t). (6) 

4 Fourier Pseudospectral Methods 

Fourier pseudospectral methods are a class of numerical methods to solve partial 
differential equations that utilize the Fourier transform. These methods became 
popular after the publication of 0; other expositions are in 0, 0, (H, J51, lfT6l . 
[17J and [19]. These methods utilize the fact that differentiation of a function is a 
simple and fast multiplication by the wave number in Fourier space. The nonlinear 
terms in are computed in real space. This section gives a brief description of 
the second order in time algorithms used. The programs are available at jhttp :"] 
|//arxiv. org /abs/1206. 3215] 

4.1 A Numerical Method for the Nonlinear Schrodinger Equa- 
tion 

The nonlinear Schrodinger equation is approximated by splitting it into two equa- 
tions which can be solved exactly, lfT6l and lTT9l . The Fourier transform of ip will 
be denoted by ip. In Fourier space one first solves 

l^t + ^Pxx +^yy=0 (7) 
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for half a time step, 0.58t. Letting k x and k y denote the wave numbers in the x 
and y directions, eq. {7]) is solved by ip(t = 0.55t) = exp[— i(kl + ky)8t]^(t = 0). 
The solution of 

#t = (8) 

for a full time step is ip(t = 8t) = exp(i\ip(t = 0)\ 2 5t)ip(t = 0), since \ip\ 2 is 
conserved. Finally eq. Q is solved for another half time step to get the solution a 
full time step later. 

4.2 A Numerical Method for the Sine-Gordon Equation 

Following 01, the second derivative in time is approximated by a central differ- 
ence. The resulting numerical method is 

u n+1 - 2u n + u"" 1 / 'u n+1 + 2u n + u n - x 
+ A 



St 2 

= -sinw n (9) 

4.3 A Numerical Method for the 2D Navier-Stokes Equation 

Time is discretized using the Crank-Nicolson method, where the nonlinear terms 
are solved for using fixed point iteration 



A ( u n + 1 ' k + 1 + LJ n 



St 

The superscript n denotes the time step and the superscript k denotes the iterate. 
The fixed point iterations stop when 

tolerance > ff (w n+1 ' fc - cu n+1 > k+1 ) 2 dxdy. (11) 



5 Results 

All three equations are solved using different numerical methods so actual compu- 
tation time comparisons between methods are of less interest than common scaling 
trends. For this comparison all of the codes were compiled with the compilers and 
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Table 1: A list of the compiler options used to build the codes. Codes in paren- 
theses use all flags in row. PGI 12.4 requires the flag -Mlarge_arrays when more 
than 2Gb are allocated. 





Compiler 


Flags 


Libraries 






-03 -Mcuda 


CUFFT 


GPU: Cuf and (OpenACC) 


PGI 12.4 


-Minfo (-acc) 
(-ta=nvidia) 


4.1.28 




CUDA 4.0 


-03 


CUFFT 


GPU: C 


VO.2.1221 


-arch=sm_20 


4.1.28 




GFORTRAN 


-03 


FFTW 


CPU: 1 core and (16 core) 


GCC: 4.6.2 


(-fopenmp) 


3.3 



Table 2: Computation times in seconds for 20 time steps of 10 5 for a Fourier 
split step method for the cubic nonlinear Schrodinger equation on [— 5tt, 5ti} 2 . 



Grid 


GPU 


GPU 


GPU 


CPU 


CPU 


Size 


(Cuf) 


(C) 


(OpenACC) 


(16 cores) 


(1 core) 


64 2 


0.00292 


0.00618 


0.00272 


0.0180 


0.0180 


128 2 


0.00366 


0.007189 


0.00459 


0.0130 


0.086 


256 2 


0.00802 


0.0116 


0.0130 


0.148 


0.442 


512 2 


0.0234 


0.0315 


0.0369 


0.562 


1.94 


1024 2 


0.0851 


0.105 


0.132 


2.27 


12.7 


2048 2 


0.334 


0.415 


0.527 


9.67 


57.2 


4096 2 


1.49 


2.02 


2.30 


37.7 


329 


8192 2 


6.30 


N/A 


N/A 


292.4 


1454 



flags shown in Table [T] 

The CPU simulations were run on AMD Opteron Magny-Cours 6136 2.4 GHz 
dual-socket eight-core processors with 48GB of memory and the GPU simula- 
tions on Nvidia Fermi M2070 with 6GB of memory per GPU. Double precision 
floating point arithmetic was used. Computation times are only measured for ad- 
vancing the numerical solution forward in time, they do not include setup time 
for creating FFT plans, allocations, calculating exact solutions, etc. because in 
production simulations where many time steps are taken, these will be negligible. 
Performance differences between FORTRAN compilers on CPUs do not change 
the conclusions. 
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Table 3: Computation times in seconds for 500 time steps of the sine-Gordon 
equation on [— 5n, 5n} 2 . N/A implies that the code could not be run due to memory 
constraints. 



Grid 


GPU 


GPU 


GPU 


CPU 


CPU 


Size 


(Cuf) 


(C) 


(OpenACC) 


(16 cores) 


(1 core) 


64 2 


0.028 


0.025 


0.028 


0.050 


0.098 


128 2 


0.040 


0.031 


0.041 


0.107 


0.401 


256 2 


0.099 


0.065 


0.095 


1.960 


1.899 


512 2 


0.343 


0.260 


0.344 


2.925 


9.301 


1024 2 


1.148 


0.976 


1.218 


19.07 


38.34 


2048 2 


4.485 


4.007 


4.869 


67.84 


165.9 


4096 2 


18.09 


16.53 


19.95 


481.2 


785.5 


8192 2 


85.45 


N/A 


93.51 


934.2 


4556 



5.1 The Cubic Nonlinear Schrodinger Equation 

The programs used 4 double complex arrays, 2 arrays for the actual computa- 
tion and 2 further arrays, to calculate the mass and energy. In all tests, the mass 
was conserved up to machine precision and the energy was constant to within 6 
significant figures. 

Table [2] shows that the GPU gives a speed up of up to a factor of 40 compared 
to a multicore OpenMP CPU implementation. An initial naive CUDA FORTRAN 
implementation for which kernel parallelization options are left to the compiler 
was typically a factor of 2 slower than the CUDA C implementation for which 
block sizes were specified. GPUs had speed ups of order 100 times compared to 
a single CPU core. Figure [T] shows that for this code, CUDA Fortran had the best 
performance. 

5.2 The Sine-Gordon Equation 

The programs used real-to-complex Fourier transforms. They required 2 double 
precision arrays of size N x x N y and 3 double complex arrays of size N x x(N y /2+ 
1). In all tests, the final Hamiltonian was within 6 significant figures. 

Table [3] shows that CUDA C performs the best, followed closely by CUDA 
FORTRAN and OpenACC. The differences between the GPU implementations 
were relatively small compared to the difference between GPU and CPU imple- 
mentations. The GPU implementations performed on the order of 50 times better 
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Table 4: Computation times in seconds for 20 time steps of 1.25 x 1CT 3 for the 
incompressible Navier-Stokes equation on [0, l] 2 . Each simulation generated the 
same max error of 1.65 x 1CT 5 when compared to the exact Taylor-Green solution. 
The tolerance for fixed point iterations was set to 1CT 10 . 



Grid 


GPU 


GPU 


GPU 


CPU 


CPU 


Size 


(Cuf) 


(C) 


(OpenACC) 


(16 Cores) 


(1 core) 


64 2 


0.0151 


0.0164 


0.0171 


.0018 


0.014 


128 2 


0.0201 


0.0266 


0.0204 


0.024 


0.065 


256 2 


0.044 


0.0435 


0.0418 


0.417 


0.37 


512 2 


0.119 


0.141 


0.1205 


0.939 


1.734 


1024 2 


0.357 


0.520 


0.398 


4.89 


7.65 


2048 2 


1.386 


1.88 


1.59 


17.04 


33.56 


4096 2 


5.814 


7.48 


6.79 


112.83 


172.67 



than a single core CPU. Figure [T] shows that for this code, CUDA C had the best 
performance. 

5.3 The 2D Navier-Stokes Equation 

The programs use the same real-to-complex Fourier transforms as the sine-Gordon 
equation, where the real- valued input array has a size of N x x N y and the complex- 
valued output array is of size N x x (N y /2 + 1). 10 arrays are used for the GPU 
codes and 11 arrays are used for CPU codes, where the extra array is required 
because FFTW does not preserve its input. Five complex-real transforms and 1 
real-complex transform are used in the timestepping loop. 

Table [4] shows that the GPU gives a speed up of up to a factor of 30 compared 
to a single CPU core. Figure [T] shows that for this code, OpenACC does better 
than CUDA C, and CUDA FORTRAN has the best performance. 

6 User Comparison of the different GPU program- 
ming environments 

Porting existing FORTRAN codes to CUDA FORTRAN was simple and intuitive; 
the combination of FORTRAN and Open ACC was a little less intuitive, but still 
relatively straightforward. The differences between CPU and GPU versions with 
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Figure 1: Comparison of OpenACC benchmark time to CuF, CUDA C, and CPU 
times for a grid size of 4096 2 (bigger is better). 



Speedup for n=4096 
(bigger is better) 




CUDA FORTRAN CUDAC OpenACC CPU (16 cores) CPU (1 core) 



CUDA FORTRAN or OpenACC are small, so it is feasible to maintain CPU and 
GPU versions of the same code. In contrast, porting a FORTRAN code to CUDA 
C is error prone, requires significant reprogramming and a good understanding of 
the GPU architecture. 

Finally, d observed that the F2C-ACC directive based FORTRAN to CUDA 
compiler results in a runtime code with better performance than regular CUDA 
FORTRAN. The performance of CUDA FORTRAN codes may experience very 
different speedups with minor changes to the parallelization or compiler options. 
Choosing these options optimally requires some knowledge of the underlying ar- 
chitecture - autotuning compilers would be useful for doing this. 
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